class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 2 ### New York University ### 09/15/2020 --- ## Reminders - Assignment 1 - Due: **09/23/2020 11:55pm (EST)** - Assignment 2 is available now - Due: **10/02/2020 11:55pm (EST)** - Office hours - Monday 9 - 10am (EST) - Wednesday 12:30 - 1:30pm (EST) - Office hour notes - Available on NYU Classes under the "Resources" tab --- ## Fun slide **"Get ten bagels when you arrive at the bakery. If you see muffins, get one."** How many bagels will you buy? - A. 10 - B. 1 --- ## Another fun slide **"Get ten bagels when you arrive at the bakery. If you see muffins, get one."** ```r has_Muffin <- TRUE shopping_at <- function (location) { if (location == "bakery") { if (has_Muffin == TRUE) { bagel <- 1 } else { bagel <- 10 } } return(bagel) } shopping_at(location = "bakery") ``` ``` ## [1] 1 ``` --- ## One more fun slide ```r has_Muffin <- TRUE shopping_at <- function (human = TRUE, location) { if (location == "bakery") { * if (human == TRUE) { if (has_Muffin == TRUE) { bagel <- 10 muffin <- 1 } else { bagel <- 10 } * } else { if (has_Muffin == TRUE) { bagel <- 1 } else { bagel <- 10 } } } return(bagel) } shopping_at(human = TRUE, location = "bakery") # 10 shopping_at(human = FALSE, location = "bakery") # 1 ``` --- ## How about another fun slide? To make things right, try to talk like this: When you arrive at the bakery: - If there are muffins, buy 1 muffin and 10 bagels. Checkout. - If there is no muffin, buy 10 bagels. Checkout. --- ## So many fun slides How we communicate: - "Get some bagels. Maybe we should get some muffins. I don't know. If muffins look good, get some. Oh! Don't forget brownies. If you buy all three, get less bagels cause that will be too much. Human: totally fine! don't worry I got it! Computer: - error: "some" is not defined - logic error: "maybe" - logic error: "I don't know, either." - error: "good" is not defined - error: "less" is not defined - logic warning: "If you buy all three..." --- ## Today's topics - Math Review - Ordinary Least Squares Linear Regression - Programming in R - Simple regression analysis from scratch --- class: inverse, middle, center # Math Review --- ## Linear Regression There are many types of regression analysis. In this course, we will focus on linear regression. `$$Y_i = \beta_0 + \beta_1 \times X_i + \varepsilon$$` where `\(\beta_0\)` is the intercept, `\(\beta_1\)` is the slope, and `\(\varepsilon\)` is the error. --- ## We need some data Let's create 100 data points with two variables: `X` and `Y`, which are highly correlated. $$ Y_i = X_i + \varepsilon $$ ```r epsilon <- rnorm(n = 100, mean = 0, sd = 1) dat_sim <- data.frame( X <- rnorm(n = 100, mean = 20, sd = 2), Y <- X + epsilon, diff = Y - X ) colnames(dat_sim) <- c("X", "Y", "Diff.") *head(dat_sim) ``` ``` ## X Y Diff. ## 1 17.55815 18.06136 0.5032112 ## 2 17.93834 17.03734 -0.9009984 ## 3 22.20215 23.67810 1.4759490 ## 4 19.12706 19.35034 0.2232747 ## 5 22.18025 21.93926 -0.2409937 ## 6 20.12013 20.38085 0.2607256 ``` --- ## Let's visualize the simulated data set
``` ## [1] "Correlation Coefficient: 0.874" ``` --- ## We need a model Let's fit a linear regression model. A linear regression defines a line: `$$\hat{Y} = \beta_0 + \beta_1 X_i$$`
--- ## What does this mean? **Regression line** is the line that minimizes the **squared distances/errors** between `\(Y_i\)` and the line. In this case: `$$(\beta_0, \beta_1) = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - \hat {Y_i})^2 = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - (\beta_0 + \beta_1 X_i))^2$$`
--- ## We love math! `$$\begin{align} \mathcal{Q}(\beta_0, \beta_1) & = \sum_{i=1}^{n} (Y_i - \hat {Y_i})^2 \\ & = \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2 \\ & = \sum_{i=1}^n (Y_i - \beta_0 -\beta_1 X_i)^2 \\ & = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i \end{align}$$` --- ## We love math so much! `$$\mathcal{Q}(\beta_0, \beta_1) = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i$$` `$$\frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X}$$` `$$\frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i$$` where `\(\overline{Y}\)` is the mean of `\(Y\)`, and `\(\overline{X}\)` is the mean of `\(X\)`. --- ## We love math! (Yes we do!) `\(\frac{\partial \mathcal{Q}}{\partial \beta_0}\)` is the change of line equation if its intercept ($\beta_0$) is varied and its slope ($\beta_1$) is kept constant. `\(\frac{\partial \mathcal{Q}}{\partial \beta_1}\)` is the change of line equation if its slope ($\beta_1$) is varied and its intercept ($\beta_0$) is kept constant. --- ## Almost there ... To get the Ordinary **Least** Squares (OLS), we need to make sure there is no freedom of movement. This means, check if the second derivatives are positive. `$$\frac{\partial \mathcal{Q}}{\partial \beta_0} = 2n \beta_0 - 2n \overline{Y} + 2n \beta_1 \overline{X}$$` `$$\frac{\partial \mathcal{Q}}{\partial \beta_0^2} = 2n$$` `$$\frac{\partial \mathcal{Q}}{\partial \beta_1} = 2 \beta_1 \sum_{i=1}^n X_i^2 + 2n \beta_0 \overline{X} - 2n \sum_{i=1}^n Y_i X_i$$` `$$\frac{\partial \mathcal{Q}}{\partial \beta_1^2} = 2 \sum_{i=1}^n X_i^2$$` `$$\frac{\partial \mathcal{Q}}{\partial \beta_0 \beta_1} = 2 \sum_{i=1}^n X_i = 2n \overline{X}$$` --- ## Final Math slide I promise Then, solve `\(\frac{\partial \mathcal{Q}}{\partial \beta_0}\)` and `\(\frac{\partial \mathcal{Q}}{\partial \beta_1}\)`. We can get: `$$\frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X}$$` `$$\frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i$$` `$$\beta_0 = \overline{Y} - \beta_1 \overline{X}$$` `$$\beta_1 = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {\sum_{i=1}^n (X_i - \overline{X})^2} = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {n \times Var.(X_i)} = \frac{Corr.(X, Y) \times S.D.(Y)}{S.D.(X)}$$` The higher the correlation, the higher the slope. --- ## Correlation and slope
--- class: inverse, center, middle # Programming in R --- ## Getting the data set [2017 Cherry Blossom Run Data, 2017](https://www.openintro.org/data/index.php?data=run17) [Download Here](https://www.openintro.org/data/csv/run17.csv) Available on NYU Classes under the "Resources - Tong's Lab - Lab 3" --- ## The Data Set {.smaller} Number of rows: 19,961 Variables - bib: number on the runner's bib (ID) - name: initials - sex: biological sex - age - city: home city - net_sec: time record in seconds (accounting for staggered starting time) - clock_sec: time record in seconds (based on the clock) - pace_sec: average time per mile, in seconds - event: either the "10 Mile" race or the "5K" --- ## Import the data set to RStudio ```r # Set working directory # Load the data set dat <- read.csv("run17.csv") ``` --- ## Check dimensions **Dimensions:** number of rows, number of columns ```r # Count number of rows and save as N *N <- nrow(dat) ``` ```r # Count number of columns *ncol(dat) # Check column types *str(dat) ``` --- ## Descriptive Analysis ```r summary(dat) ``` ``` ## bib name sex age ## Min. : 3 Michael M. : 30 F:12267 Min. : 6.00 ## 1st Qu.: 6078 Jennifer S. : 28 M: 7694 1st Qu.:29.00 ## Median :12128 John M. : 26 Median :35.00 ## Mean :12300 Michael C. : 25 Mean :37.13 ## 3rd Qu.:18321 Stephanie S.: 24 3rd Qu.:44.00 ## Max. :25497 John S. : 23 Max. :92.00 ## (Other) :19805 NA's :1 ## city net_sec clock_sec pace_sec ## Washington, DC : 4861 Min. :1078 Min. :1078 Min. : 279.0 ## Arlington, VA : 1789 1st Qu.:4767 1st Qu.:5048 1st Qu.: 524.0 ## Alexandria, VA : 921 Median :5658 Median :6333 Median : 596.0 ## Silver Spring, MD: 534 Mean :5428 Mean :6083 Mean : 608.7 ## New York, NY : 434 3rd Qu.:6433 3rd Qu.:7439 3rd Qu.: 674.0 ## Bethesda, MD : 390 Max. :8398 Max. :9896 Max. :1759.0 ## (Other) :11032 NA's :3 NA's :3 NA's :3 ## event ## 10 Mile:17442 ## 5K : 2519 ## ## ## ## ## ``` ```r # For the net_sec column # mean(dat$net_sec) # var(dat$net_sec) # sd(dat$net_sec) *sd(dat$net_sec) / sqrt(N) # Standard error ``` ``` ## [1] NA ``` --- ## Hypotheses `\(H_0\)`: For 10 Mile race, the mean of the net finish time of male runners is the same as female runners. `\(H_1\)`: For 10 Mile race, the mean of the net finish time of male runners is **not** the same as female runners. --- ## Subset the data set ```r dat_10m <- dat[dat$even == "10 Mile", ] dat_10m$sex <- ifelse(dat_10m$sex == "M", 1, 0) ``` --- ## Levene's test To examine if pooled variance is suitable. ```r leveneTest(net_sec ~ factor(sex), data = dat_10m) ``` ``` ## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) *## group 1 34.267 4.891e-09 *** ## 17440 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## T-test ```r t.test(net_sec ~ sex, data = dat_10m, var.equal = FALSE) ``` ``` ## ## Welch Two Sample t-test ## ## data: net_sec by sex *## t = 40.434, df = 14287, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 582.1310 641.4466 ## sample estimates: ## mean in group 0 mean in group 1 ## 6113.618 5501.829 ``` --- ## Linear regression ```r summary(lm(net_sec ~ sex, data = dat_10m)) ``` ``` ## ## Call: ## lm(formula = net_sec ~ sex, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2896.62 -687.62 -38.62 640.33 2896.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6113.618 9.454 646.69 <2e-16 *** ## sex -611.789 14.927 -40.98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 966.2 on 17440 degrees of freedom ## Multiple R-squared: 0.08786, Adjusted R-squared: 0.0878 ## F-statistic: 1680 on 1 and 17440 DF, p-value: < 2.2e-16 ``` --- ## Visualize regression results ```r *plot(x = dat_10m$sex, y = dat_10m$net_sec, type = "p") abline(a = 6113.618, b = -611.789, lwd = 2) ``` <!-- --> --- ## Let's do another regression analysis **On average, the older, the slower.** ```r summary(lm2 <- lm(net_sec ~ age, data = dat_10m)) ``` ``` ## ## Call: ## lm(formula = net_sec ~ age, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5388.0324 27.2989 197.37 <2e-16 *** ## age 12.9782 0.7087 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16 ``` --- ## Visualize regression results ```r *plot(x = dat_10m$age, y = dat_10m$net_sec, type = "p", col = "grey") abline(a = 5388.0324, b = 12.9782, lwd = 2) ``` <!-- --> --- ## De-mean ```r *dat_10m$age0 <- dat_10m$age - mean(dat_10m$age) summary(lm3 <- lm(net_sec ~ age0, data = dat_10m)) ``` ``` ## ## Call: ## lm(formula = net_sec ~ age0, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5868.2292 7.5877 773.38 <2e-16 *** ## age0 12.9782 0.7087 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16 ``` --- ## Visualize regression results ```r plot(x = dat_10m$age0, y = dat_10m$net_sec, type = "p", col = "grey") abline(a = 5868.2292, b = 12.9782, lwd = 2) abline(v = 0, lty = "dashed", col = "red") ``` <!-- --> --- ## Standardize ```r dat_10m$age1 <- dat_10m$age0 / sd(dat_10m$age0) summary(lm4 <- lm(net_sec ~ age1, data = dat_10m)) ``` ``` ## ## Call: ## lm(formula = net_sec ~ age1, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5868.229 7.588 773.38 <2e-16 *** ## age1 138.950 7.588 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16 ``` --- ## Visualize regression results ```r plot(x = dat_10m$age1, y = dat_10m$net_sec, type = "p", col = "grey", xlab = "SD") abline(a = 5868.229, b = 138.950, lwd = 2) abline(v = 0, lty = "dashed", col = "red") ``` <!-- -->